Goto

Collaborating Authors

 mixture-of-expert language model


Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Nakamura, Taishi, Ishikawa, Satoki, Kawamura, Masaki, Okamoto, Takumi, Nohara, Daisuke, Suzuki, Jun, Yokota, Rio

arXiv.org Artificial Intelligence

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-k routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, T otal tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. The recent evolution of large language models (LLMs) has been driven by empirical scaling laws (Hestness et al., 2017) that link training loss to model size, dataset size, and compute budget. Kaplan et al. (2020) showed that these laws hold across seven orders of magnitude, establishing them as a reliable extrapolation tool for dense Transformers. Subsequent work by Hoffmann et al. (2022) demonstrated that scaling curves can be inverted to choose the compute-optimal combination of parameters and tokens for a fixed budget. Together, these results have made scaling analysis a cornerstone of model planning at both academic and industrial labs.


Megrez2 Technical Report

Li, Boxun, Li, Yadong, Li, Zhiyuan, Liu, Congyi, Liu, Weilin, Niu, Guowei, Tan, Zheyue, Xu, Haiyang, Yao, Zhuyu, Yuan, Tao, Zhou, Dong, Zhuang, Yueqing, Zhao, Bo, Dai, Guohao, Wang, Yu

arXiv.org Artificial Intelligence

We present Megrez2, a novel lightweight and high-performance language model architecture optimized for device native deployment. Megrez2 introduces a novel cross-layer expert sharing mechanism, which significantly reduces total parameter count by reusing expert modules across adjacent transformer layers while maintaining most of the model's capacity. It also incorporates pre-gated routing, enabling memory-efficient expert loading and faster inference. As the first instantiation of the Megrez2 architecture, we introduce the Megrez2-Preview model, which is pre-trained on a 5-trillion-token corpus and further enhanced through supervised fine-tuning and reinforcement learning with verifiable rewards. With only 3B activated and 7.5B stored parameters, Megrez2-Preview demonstrates competitive or superior performance compared to larger models on a wide range of tasks, including language understanding, instruction following, mathematical reasoning, and code generation. These results highlight the effectiveness of the Megrez2 architecture to achieve a balance between accuracy, efficiency, and deployability, making it a strong candidate for real-world, resource-constrained applications.


Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights

Krajewski, Jakub, Chochowski, Marcin, Korzekwa, Daniel

arXiv.org Artificial Intelligence

Mixture of Experts (MoE) architectures have emerged as pivotal for scaling Large Language Models (LLMs) efficiently. Fine-grained MoE approaches - utilizing more numerous, smaller experts - have demonstrated potential in improving model convergence and quality. This work proposes a set of training recipes and provides a comprehensive empirical evaluation of fine-grained MoE, directly comparing its scaling properties against standard MoE configurations for models with up to 56B total (17B active) parameters. We investigate convergence speed, model performance on downstream benchmarks, and practical training considerations across various setups. Overall, at the largest scale we show that fine-grained MoE achieves better validation loss and higher accuracy across a set of downstream benchmarks. This study offers empirical grounding and practical insights for leveraging fine-grained MoE in the development of future large-scale models.


Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Pan, Bowen, Shen, Yikang, Liu, Haokun, Mishra, Mayank, Zhang, Gaoyuan, Oliva, Aude, Raffel, Colin, Panda, Rameswar

arXiv.org Artificial Intelligence

Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4$\times$ times more parameters to achieve comparable performance to a dense model, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. In this work, we propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Our experiments on training LLMs demonstrate that our DS-MoE models are more parameter-efficient than standard sparse MoEs and are on par with dense models in terms of total parameter size and performance while being computationally cheaper (activating 30-40% of the model's parameters). Performance tests using vLLM show that our DS-MoE-6B model runs up to $1.86\times$ faster than similar dense models like Mistral-7B, and between $1.50\times$ and $1.71\times$ faster than comparable MoEs, such as DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B.


Fast Inference of Mixture-of-Experts Language Models with Offloading

Eliseev, Artyom, Mazur, Denis

arXiv.org Artificial Intelligence

With the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) - a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their dense counterparts, but it also increases model size due to having multiple experts. Unfortunately, this makes state-of-the-art MoE language models difficult to run without high-end GPUs. In this work, we study the problem of running large MoE language models on consumer hardware with limited accelerator memory. We build upon parameter offloading algorithms and propose a novel strategy that accelerates offloading by taking advantage of innate properties of MoE LLMs. Using this strategy, we build can run Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google Colab instances.